25 research outputs found

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

    Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

    No full text
    As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others. In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler

    Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

    No full text
    As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler

    Regulation of lubricin for functional cartilage tissue regeneration: a review

    Get PDF
    Abstract Background Lubricin is chondrocyte-secreted glycoprotein that primarily conducts boundary lubrication between joint surfaces. Besides its cytoprotective function and extracellular matrix (ECM) attachment, lubricin is recommended as a novel biotherapeutic protein that restore functional articular cartilage. Likewise, malfunction of lubrication in damaged articular cartilage caused by complex and multifaceted matter is a major concern in the field of cartilage tissue engineering. Main body Although a noticeable progress has been made toward cartilage tissue regeneration through numerous approaches such as autologous chondrocyte implantation, osteochondral grafts, and microfracture technique, the functionality of engineered cartilage is a challenge for complete reconstruction of cartilage. Thus, delicate modulation of lubricin along with cell/scaffold application will expand the research on cartilage tissue engineering. Conclusion In this review, we will discuss the empirical analysis of lubricin from fundamental interpretation to the practical design of gene expression regulation

    Convergence and Scalarization for Data-Parallel Architectures

    No full text
    Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%

    Efficient, high-quality image contour detection

    No full text
    Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector, along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector provides uncompromised contour accuracy, with an F-metric of 0.70 on the Berkeley Segmentation Dataset. Runtime is reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable high-quality image contour detection on much larger images than previously practical, and the algorithms we propose are applicable to several image segmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increased performance in many computer vision applications. 1

    Experimental assessment of the effect of frozen fringe thickness on frost heave

    No full text
    A frozen fringe plays a key role in frost heave development in soils. Previous studies have focused on the physical and mechanical properties of the frozen fringe, such as overall hydraulic conductivity, water content and pore pressure. It has been proposed that the thickness of the frozen fringe controls frost heave behavior, but this effect has not been thoroughly evaluated. This study used a temperature-controllable cell to investigate the impact of frozen fringe thickness on the characteristics of frost heave. A series of laboratory tests was performed with various temperature boundary conditions and specimen heights, revealing that: (1) the amount and rate of development of frost heave are dependent on the frozen fringe thickness; (2) the thicker the frozen fringe, the thinner the resulting ice lens; and (3) care must be taken when using the frost heave ratio to characterize frost heave and evaluate frost susceptibility because the frost heave ratio is not a normalized factor but a specimen height-dependent factor
    corecore